Improved Estimation of Class Prior Probabilities through Unlabeled Data
نویسنده
چکیده
Work in the classification literature has shown that in computing a classification function, one need not know the class membership of all observations in the training set; the unlabeled observations still provide information on the marginal distribution of the feature set, and can thus contribute to increased classification accuracy for future observations. The present paper will show that this scheme can also be used for the estimation of class prior probabilities, which would be very useful in applications in which it is difficult or expensive to determine class membership. Both parametric and nonparametric estimators are developed. Asymptotic distributions of the estimators are derived, and it is proven that the use of the unlabeled observations does reduce asymptotic variance. This methodology is also extended to the estimation of subclass probabilities.
منابع مشابه
Estimation of Squared-Loss Mutual Information from Positive and Unlabeled Data
Capturing input-output dependency is an important task in statistical data analysis. Mutual information (MI) is a vital tool for this purpose, but it is known to be sensitive to outliers. To cope with this problem, a squared-loss variant of MI (SMI) was proposed, and its supervised estimator has been developed. On the other hand, in real-world classification problems, it is conceivable that onl...
متن کاملNonparametric semi-supervised learning of class proportions
The problem of developing binary classifiers from positive and unlabeled data is often encountered in machine learning. A common requirement in this setting is to approximate posterior probabilities of positive and negative classes for a previously unseen data point. This problem can be decomposed into two steps: (i) the development of accurate predictors that discriminate between positive and ...
متن کاملClassification on Data with Biased Class Distribution
Labeled data for classification could often be obtained by sampling that restricts or favors choice of certain classes. A classifier trained using such data will be biased, resulting in wrong inference and sub-optimal classification on new data. Given an unlabeled new data set we propose a bootstrap method to estimate its class probabilities by using an estimate of the classifier's accuracy on ...
متن کاملAnalysis of Learning from Positive and Unlabeled Data
Learning a classifier from positive and unlabeled data is an important class of classification problems that are conceivable in many practical applications. In this paper, we first show that this problem can be solved by cost-sensitive learning between positive and unlabeled data. We then show that convex surrogate loss functions such as the hinge loss may lead to a wrong classification boundar...
متن کاملClass Prior Estimation from Positive and Unlabeled Data
We consider the problem of learning a classifier using only positive and unlabeled samples. In this setting, it is known that a classifier can be successfully learned if the class prior is available. However, in practice, the class prior is unknown and thus must be estimated from data. In this paper, we propose a new method to estimate the class prior by partially matching the class-conditional...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1510.01422 شماره
صفحات -
تاریخ انتشار 2015